LW server reports: not allowed.
This probably means the post has been deleted or moved back to the author's drafts.
Can language models preserve their own alignment?
Wouldn’t consider this an argument for; rather a project proposal to empirically test how much models remain “good”
Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion
They call this scalable oversight but they both train and evaluate the lie detector probes labeled DolusChat examples. I don’t get why they call it scalable.Claude’s analysishttps://claude.ai/share/d659b385-f625-4b86-9eb5-f8ce1fea33e5
Training fails to elicit subtle reasoning in current language models
I think Sonnet 3.5 monitor/Sonnet 3.7 attacker was a narrow intelligence gap. The paper itself says “as developers continue to scale RL reasoning compute, models may become more capable of subtle reasoning.”I see this as a minor empirical result with unclear generalisability
which proxies to train against.
https://www.lesswrong.com/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in?commentId=krg2jzDxXhei9vNLjand Daniel Kokotajlo comment about preserving at least one output stream that isn’t optimised against(this could be activations, while doing cot+output monitoring)
Wouldn’t consider this an argument for; rather a project proposal to empirically test how much models remain “good”
They call this scalable oversight but they both train and evaluate the lie detector probes labeled DolusChat examples. I don’t get why they call it scalable.
Claude’s analysis
https://claude.ai/share/d659b385-f625-4b86-9eb5-f8ce1fea33e5
I think Sonnet 3.5 monitor/Sonnet 3.7 attacker was a narrow intelligence gap.
The paper itself says “as developers continue to scale RL reasoning compute, models may become more capable of subtle reasoning.”
I see this as a minor empirical result with unclear generalisability
https://www.lesswrong.com/posts/G9HdpyREaCbFJjKu5/it-is-reasonable-to-research-how-to-use-model-internals-in?commentId=krg2jzDxXhei9vNLj
and Daniel Kokotajlo comment about preserving at least one output stream that isn’t optimised against(this could be activations, while doing cot+output monitoring)